Learning Check

When you complete this session, you will be able:

  1. to understand about variable selection in multiple linear regression (MLR);

  2. to understand how to be dividing a dataset into a training and test set so that we can construct models using the former, and then test their predictive ability using the latter.

  3. to understand how to calculating internal measures of predictive ability such as PRESS (predicted residual sum of squares);

  4. to understand about multicollinearity.

# Load these libraries that are included in later code; install them on your computer if they aren't already there.
library(leaps)
library(knitr)
library(corrplot)

Question 1 Boston data

The Boston data frame has 506 rows and 14 columns, about housing values in suburbs of Boston. This dataset is available within MASS R package.

The aim is to predict housing values in suburbs of Boston based on 13 explanatory variables in the dataset.

This data frame contains the following columns:

 - crim: per capita crime rate by town.
 - zn: proportion of residential land zoned for lots over 25,000 sq.ft.
 - indus: proportion of non-retail business acres per town.
 - chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
 - nox: nitrogen oxides concentration (parts per 10 million).
 - rm: average number of rooms per dwelling.
 - age: proportion of owner-occupied units built prior to 1940.
 - dis: weighted mean of distances to five Boston employment centres.
 - rad: index of accessibility to radial highways.
 - tax: full-value property-tax rate per $10,000.
 - ptratio: pupil-teacher ratio by town.
 - black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
 - lstat: lower status of the population (percent).
 - medv: median value of owner-occupied homes in $1000s.

Answer the following 8 tasks.

Hint. Use R to retrieve the data as follows. This dataset is available within MASS R package, you must install this package and then use the following.

library(MASS) 
attach(Boston)

head(Boston) # to view the first 6 rows
     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
  medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7
#View(Boston) # uncomment to view the dataset
#str(Boston)  # uncomment to see the structure
  1. Based on the description in the above, what is the response variable?

The response variable is medv

  1. Consider the subset of the dataset that consists of numerical variables only (call this Boston1).Provide a scatterplot matrix using all numerical variables. Briefly comment about the output, by stating two predictors with the highest correlations (either positive or negative) to the response.

We know that chas and rad are categorical.

Boston1 <-subset(Boston,select= c(-chas, -rad))
dim(Boston1)
[1] 506  12
library(PerformanceAnalytics)
chart.Correlation(Boston1)  # excluding categorical variable

The variables rm (+0.70) and lstat (-0.74) are with the highest correlations.

  1. Before you carry out any model-building, you split the dataset Boston1 into a training set and a test set so that the training set contains 80% of the data in Boston1 and the test set contains 20%. Show how you would do that. Call your training and test set Train.Boston1 and Test.Boston1, respectively.
set.seed(2401)
TestIndex <- sample(nrow(Boston1), floor(0.2*nrow(Boston1)))  
Test.Boston1 <- Boston1[TestIndex, ] 
Train.Boston1 <- Boston1[-TestIndex, ]
  1. Based on your (partial) exploratory analysis above in Task 2 (and perhaps additional analysis), you choose only one explanatory variable with which to build a simple linear regression using Train.Boston1.

    1. Which one did you choose and why?

The variable rm. It has a positive +0.70 correlation with medv

ii. Fit a simple linear regression using this variable using the `training set`. Output a summary of the linear model. What is the value of the estimated slope? 
model.slr <- lm(medv ~ rm, data = Train.Boston1)
summary(model.slr)

Call:
lm(formula = medv ~ rm, data = Train.Boston1)

Residuals:
    Min      1Q  Median      3Q     Max 
-21.691  -2.664   0.014   3.084  38.740 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -30.9149     3.0108  -10.27   <2e-16 ***
rm            8.4858     0.4771   17.79   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.916 on 403 degrees of freedom
Multiple R-squared:  0.4398,    Adjusted R-squared:  0.4384 
F-statistic: 316.4 on 1 and 403 DF,  p-value: < 2.2e-16

The value of the estimated slope is 8.4858

iii. Output diagnostic plots and comment on whether you think they indicate 
any inadequacies in your linear model. You only need to comment on the first 
three of these 'standard' diagnostic plots. 
par(mfrow=c(2,2))
plot(model.slr)

This is an example of interpretation, depends on the training set.

  1. Use forward stepwise regression to construct a multiple linear regression model for this dataset (Boston2) using the training data (Train.Boston2). Comment about the fitted model, specifically about which predictors are significant at a 5% significance level.
model.all <- lm(medv~., data=Train.Boston1)
summary(model.all)

Call:
lm(formula = medv ~ ., data = Train.Boston1)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.317  -3.028  -0.592   1.969  26.668 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.125301   5.835100   6.191 1.51e-09 ***
crim         -0.071607   0.045742  -1.565  0.11828    
zn            0.019456   0.017801   1.093  0.27507    
indus        -0.085737   0.073655  -1.164  0.24511    
nox         -17.974482   4.565558  -3.937 9.76e-05 ***
rm            3.663869   0.478957   7.650 1.57e-13 ***
age           0.007483   0.016062   0.466  0.64157    
dis          -1.570527   0.241110  -6.514 2.25e-10 ***
tax           0.002344   0.002845   0.824  0.41045    
ptratio      -0.925944   0.155434  -5.957 5.70e-09 ***
black         0.008738   0.003218   2.715  0.00691 ** 
lstat        -0.558664   0.060083  -9.298  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.131 on 393 degrees of freedom
Multiple R-squared:  0.6993,    Adjusted R-squared:  0.6909 
F-statistic: 83.08 on 11 and 393 DF,  p-value: < 2.2e-16
model.1 <- lm(medv~1, data=Train.Boston1)
summary(model.1)

Call:
lm(formula = medv ~ 1, data = Train.Boston1)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.287  -5.687  -1.287   2.713  27.713 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  22.2874     0.4586    48.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.229 on 404 degrees of freedom
# Forward selection with intermediate output
model.f<-step(model.1, scope = formula(model.all), direction = "forward", trace = 0)

The final model using Forward is

summary(model.f)

Call:
lm(formula = medv ~ lstat + rm + ptratio + dis + nox + black, 
    data = Train.Boston1)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.2605  -3.1328  -0.7291   1.8675  26.9514 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.848832   5.646546   6.526 2.06e-10 ***
lstat        -0.570857   0.053838 -10.603  < 2e-16 ***
rm            3.792846   0.459828   8.248 2.37e-15 ***
ptratio      -0.996059   0.129305  -7.703 1.07e-13 ***
dis          -1.394089   0.197644  -7.054 7.78e-12 ***
nox         -18.726505   3.857569  -4.854 1.74e-06 ***
black         0.009171   0.003111   2.948  0.00339 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.13 on 398 degrees of freedom
Multiple R-squared:  0.6956,    Adjusted R-squared:  0.691 
F-statistic: 151.6 on 6 and 398 DF,  p-value: < 2.2e-16

The predictors which are significant are: lstat; rm; ptratio; dis; nox; black

  1. Output diagnostic plots based on the Forward method in 5 and comment on whether you think they indicate any inadequacies in your linear model. You only need to comment on the first three of these ‘standard’ diagnostic plots.
par(mfrow=c(2,2))
plot(model.f)

This is an example of interpretation, depends on the training set.

  1. Despite any inadequacies that you may or may not have identified above, you use the model to make predictions of yield in the test set (Test.Boston). Calculate the RMSEP and the PRESS.
# Predictions for forward selection
pred.fwd <- predict(model.f, newdata = Test.Boston1)  ## yhat_i
# Extract actual values from test set
Actual <- Test.Boston1$medv  ## y_i  
# Calculate RMSEP 
M <- length(Actual)
RMSEP <- sqrt(sum((Actual - pred.fwd)^2)/M)
RMSEP
[1] 4.210982
library(DAAG)
press(model.f)
[1] 11064.88

Question 2 Variable Selection: Body Fat Data

  1. Load the data file BodyFat.RData
print(load("BodyFat.RData"))
[1] "TrainBodyFat1" "TestBodyFat1" 
View(TrainBodyFat1)
View(TestBodyFat1)
dim(TrainBodyFat1)
[1] 221  14
dim(TestBodyFat1)
[1] 24 14
  1. What objects does it contain?

A training set and a test set!

  1. Construct a scatterplot matrix of the training and test data. Compare them and comment.

Note that because there are 14 variables in the dataset, we need to make the scatterplot matrix a bit bigger than the default size, and we can do so by using the chunk options fig.width and fig.height. Note these are knitr/Rmarkdown options, not R code.

# Training set
pairs(TrainBodyFat1)

library(PerformanceAnalytics)
chart.Correlation(TrainBodyFat1)

pairs(TestBodyFat1)

It’s a good idea to plot the test set as well, because we want to make sure that there aren’t any outliers that might inflate prediction error.

In the training set, it’s clear that all of the variables are highly correlated - not surprising considering that body measurements tend to go up and down together.

We can see that in the training set there are a couple of individuals who have very large ankle measurements relative to other body measurements. No such outliers appear to exist in the test set.

  1. Remove the two obvious outliers from the training set.

There are lots of ways of removing the two individuals outlined above, but a convenient way is to use the function subset. See the help file for more details on how to use it. The two individuals to remove have ankle measurements of greater than 30 cm, and we will use this information to keep only those individuals whose ankle measurements are less than 30 cm.

dim(TrainBodyFat1) # dimension of original training set
[1] 221  14
TrainBodyFat1 <- subset(TrainBodyFat1, Ankle < 30)
dim(TrainBodyFat1) # dimension of training set with two observations removed
[1] 219  14
  1. Using the training set, construct a model with only an intercept and a model with all the variables.

The reason we’re going to construct the simplest model possible, and the model with the most number of variables, is so that we can use them in the simplest of all sequential selection methods, forward stepwise regression. Have a look at the lecture notes for a fuller explanation of how it works.

# Model with only intercept
lm0 <- lm(bodyfat ~ 1, data = TrainBodyFat1)
# Model with all variables
lmall <- lm(bodyfat ~ ., data = TrainBodyFat1)
summary(lmall)

Call:
lm(formula = bodyfat ~ ., data = TrainBodyFat1)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.1718  -3.0671   0.0207   3.0749   9.5016 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.99231   26.11482   0.153  0.87865    
Age          0.06523    0.03595   1.815  0.07104 .  
Weight      -0.02598    0.07343  -0.354  0.72385    
Height      -0.28600    0.20980  -1.363  0.17432    
Neck        -0.39758    0.25827  -1.539  0.12525    
Chest       -0.09845    0.11844  -0.831  0.40679    
Abdomen      0.90690    0.10092   8.986  < 2e-16 ***
Hip         -0.16953    0.16045  -1.057  0.29195    
Thigh        0.20533    0.15989   1.284  0.20053    
Knee        -0.09147    0.27593  -0.331  0.74061    
Ankle        0.13236    0.40734   0.325  0.74556    
Biceps       0.18373    0.18570   0.989  0.32362    
Forearm      0.28940    0.21856   1.324  0.18694    
Wrist       -1.69147    0.62595  -2.702  0.00746 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.353 on 205 degrees of freedom
Multiple R-squared:  0.7339,    Adjusted R-squared:  0.717 
F-statistic: 43.49 on 13 and 205 DF,  p-value: < 2.2e-16
  1. Use forward stepwise regression to select a subset of the variables.

Once we have constructed the models above, we can use them in the function for running forward stepwise selection. The argument trace = 0 suppresses intermediate output.

lmfwd <- step(lm0, scope = formula(lmall), direction = "forward", trace = 1)
Start:  AIC=921.71
bodyfat ~ 1

          Df Sum of Sq     RSS    AIC
+ Abdomen  1    9590.3  5008.0 689.41
+ Chest    1    6838.2  7760.1 785.32
+ Hip      1    5397.3  9201.1 822.62
+ Weight   1    5044.8  9553.5 830.85
+ Thigh    1    3850.0 10748.3 856.66
+ Neck     1    3359.1 11239.3 866.44
+ Knee     1    3191.5 11406.9 869.68
+ Biceps   1    2932.0 11666.3 874.61
+ Wrist    1    1669.5 12928.9 897.11
+ Forearm  1    1590.4 13007.9 898.45
+ Age      1    1561.6 13036.7 898.93
+ Ankle    1    1047.4 13550.9 907.41
<none>                 14598.3 921.71
+ Height   1      73.6 14524.8 922.60

Step:  AIC=689.41
bodyfat ~ Abdomen

          Df Sum of Sq    RSS    AIC
+ Weight   1    714.23 4293.8 657.71
+ Height   1    581.33 4426.7 664.39
+ Wrist    1    573.39 4434.6 664.78
+ Neck     1    399.27 4608.7 673.21
+ Ankle    1    379.09 4628.9 674.17
+ Hip      1    356.78 4651.2 675.22
+ Knee     1    294.89 4713.1 678.12
+ Chest    1    198.06 4810.0 682.57
+ Age      1    154.17 4853.8 684.56
+ Thigh    1    123.85 4884.2 685.93
+ Forearm  1    101.28 4906.7 686.93
+ Biceps   1    100.30 4907.7 686.98
<none>                 5008.0 689.41

Step:  AIC=657.71
bodyfat ~ Abdomen + Weight

          Df Sum of Sq    RSS    AIC
+ Wrist    1   146.804 4147.0 652.09
+ Thigh    1    70.001 4223.8 656.11
+ Neck     1    51.396 4242.4 657.07
+ Biceps   1    50.410 4243.4 657.13
+ Height   1    47.118 4246.7 657.30
<none>                 4293.8 657.71
+ Forearm  1    22.411 4271.4 658.57
+ Ankle    1     4.656 4289.1 659.47
+ Chest    1     1.727 4292.1 659.62
+ Age      1     1.214 4292.6 659.65
+ Hip      1     0.715 4293.1 659.68
+ Knee     1     0.000 4293.8 659.71

Step:  AIC=652.09
bodyfat ~ Abdomen + Weight + Wrist

          Df Sum of Sq    RSS    AIC
+ Biceps   1    70.917 4076.1 650.32
+ Height   1    52.325 4094.7 651.31
+ Forearm  1    51.997 4095.0 651.33
<none>                 4147.0 652.09
+ Thigh    1    35.113 4111.9 652.23
+ Age      1    22.963 4124.0 652.88
+ Neck     1     8.914 4138.1 653.62
+ Hip      1     3.772 4143.2 653.89
+ Ankle    1     2.619 4144.4 653.95
+ Knee     1     1.872 4145.1 653.99
+ Chest    1     0.194 4146.8 654.08

Step:  AIC=650.32
bodyfat ~ Abdomen + Weight + Wrist + Biceps

          Df Sum of Sq    RSS    AIC
<none>                 4076.1 650.32
+ Age      1   25.9071 4050.2 650.92
+ Height   1   25.7919 4050.3 650.93
+ Neck     1   24.2258 4051.8 651.01
+ Forearm  1   22.8574 4053.2 651.08
+ Thigh    1   15.1465 4060.9 651.50
+ Ankle    1    7.8960 4068.2 651.89
+ Hip      1    3.2417 4072.8 652.14
+ Knee     1    1.9689 4074.1 652.21
+ Chest    1    1.8034 4074.3 652.22
summary(lmfwd)

Call:
lm(formula = bodyfat ~ Abdomen + Weight + Wrist + Biceps, data = TrainBodyFat1)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.0819  -3.1413  -0.1155   3.3936   9.1313 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -32.55819    7.62550  -4.270 2.94e-05 ***
Abdomen       0.97858    0.05909  16.560  < 2e-16 ***
Weight       -0.12828    0.02963  -4.330 2.29e-05 ***
Wrist        -1.42443    0.48061  -2.964  0.00338 ** 
Biceps        0.31347    0.16245   1.930  0.05498 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.364 on 214 degrees of freedom
Multiple R-squared:  0.7208,    Adjusted R-squared:  0.7156 
F-statistic: 138.1 on 4 and 214 DF,  p-value: < 2.2e-16

Remarkably, only four measurements are selected by the procedure. But let’s see how well it predicts body fat.

  1. Generate predictions from the forward model and the model with all variables, and compare their RMSEP.
# Predictions from the model with all varaibles
allpred <- predict(lmall, newdata = TestBodyFat1)

# Predictions from the model selected by forward selection
fwdpred <- predict(lmfwd, newdata = TestBodyFat1)
Actual <- TestBodyFat1$bodyfat
RMSEP.all <- sqrt(sum((Actual - allpred)^2)/length(Actual)); RMSEP.all
[1] 3.391678
RMSEP.fwd <- sqrt(sum((Actual - fwdpred)^2)/length(Actual)); RMSEP.fwd
[1] 3.190346

The key message here is that more variables in a model don’t lead to better predictions: a model with four variables yields a smaller RMSEP than a model with many more variables.

Question 3 Prostate cancer data (Hastie, et al, 2009, Chapter 1).

The data Prostate Cancer (Stamey et al. (1989)) examined the correlation between the level of prostate specific antigen (PSA) and a number of clinical measures, in 97 men who were about to receive a radical prostatectomy.The dataset is prostate.txt

The goal is to predict the log of PSA lpsa from a number of measurements including log cancer volume lcavol, log prostate weight lweight, age, log of benign prostatic hyperplasia amount lbph, seminal vesicle invasion svi, log of capsular penetration lcp, Gleason score gleason, and percent of Gleason scores 4 or 5 pgg45.

  1. Import step: Read the data set in R and check the data structure.
library(dplyr)

Loading the dataset.

options(digits=3) ## output

prostate_data=read.table(file='prostate.txt',header=TRUE) # Read the data in.
head(prostate_data)
  Men lcavol lweight age  lbph svi   lcp gleason pgg45   lpsa train
1   1 -0.580    2.77  50 -1.39   0 -1.39       6     0 -0.431  TRUE
2   2 -0.994    3.32  58 -1.39   0 -1.39       6     0 -0.163  TRUE
3   3 -0.511    2.69  74 -1.39   0 -1.39       7    20 -0.163  TRUE
4   4 -1.204    3.28  58 -1.39   0 -1.39       6     0 -0.163  TRUE
5   5  0.751    3.43  62 -1.39   0 -1.39       6     0  0.372  TRUE
6   6 -1.050    3.23  50 -1.39   0 -1.39       6     0  0.765  TRUE
str(prostate_data)
'data.frame':   97 obs. of  11 variables:
 $ Men    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
 $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
 $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
 $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
 $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
 $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
 $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
 $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
 $ train  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
View(prostate_data)
  1. Wrangling step: we don’t need ID variable (Men) and we define train and test subsets data.
prostate_data=prostate_data%>%  #  Remove ID variable 
  dplyr::select(-(Men)) 

prostate_data_train=prostate_data%>%  # We only wish to keep the training data to model.
  filter(train==TRUE)

prostate_data_test=prostate_data%>%  # We only wish to keep the testing data to model.
  filter(train==FALSE)
dim(prostate_data_train)
[1] 67 10
dim(prostate_data_test)
[1] 30 10
head(prostate_data_train)
  lcavol lweight age  lbph svi   lcp gleason pgg45   lpsa train
1 -0.580    2.77  50 -1.39   0 -1.39       6     0 -0.431  TRUE
2 -0.994    3.32  58 -1.39   0 -1.39       6     0 -0.163  TRUE
3 -0.511    2.69  74 -1.39   0 -1.39       7    20 -0.163  TRUE
4 -1.204    3.28  58 -1.39   0 -1.39       6     0 -0.163  TRUE
5  0.751    3.43  62 -1.39   0 -1.39       6     0  0.372  TRUE
6 -1.050    3.23  50 -1.39   0 -1.39       6     0  0.765  TRUE
  1. Explore step: Produce a scatterplot matrix of the variables in the dataset, as Figure 1.1 of Hastie, et al (2009), Chapter 1.
pairs(prostate_data[,1:9])

# a more informatic graph
library("PerformanceAnalytics")
chart.Correlation(prostate_data[,1:9],histogram=TRUE,pch=19)

We can also create the scatterplot matrix using ggplot2 and GGally R packages.

library(dplyr)
library(ggplot2)
library(GGally)
ggpairs(prostate_data, columns=1:9)

We can see that lcavol, lweight and age appear to be positively correlated with lpsa.

  1. Obtain the correlation matrix of the data.
cor(prostate_data[1:9])
        lcavol lweight   age    lbph     svi    lcp gleason  pgg45  lpsa
lcavol  1.0000  0.2805 0.225  0.0273  0.5388  0.675  0.4324 0.4337 0.734
lweight 0.2805  1.0000 0.348  0.4423  0.1554  0.165  0.0569 0.1074 0.433
age     0.2250  0.3480 1.000  0.3502  0.1177  0.128  0.2689 0.2761 0.170
lbph    0.0273  0.4423 0.350  1.0000 -0.0858 -0.007  0.0778 0.0785 0.180
svi     0.5388  0.1554 0.118 -0.0858  1.0000  0.673  0.3204 0.4576 0.566
lcp     0.6753  0.1645 0.128 -0.0070  0.6731  1.000  0.5148 0.6315 0.549
gleason 0.4324  0.0569 0.269  0.0778  0.3204  0.515  1.0000 0.7519 0.369
pgg45   0.4337  0.1074 0.276  0.0785  0.4576  0.632  0.7519 1.0000 0.422
lpsa    0.7345  0.4333 0.170  0.1798  0.5662  0.549  0.3690 0.4223 1.000
  1. As the variables are on differing scales, apply standardisation by mean and standard deviation. Do not standardise the variable lpsa (response).

This is important when predictors are on differing scales. We standardize by mean and standard deviation. We do NOT scale our response variable. In larger datasets, standardization also speeds up learning algorithms.

First we exclude lpsa and reorder the variables to be the same order as the reference.

predictors_train=prostate_data_train[ ,c('lcavol','lweight','age','lbph','svi','lcp','gleason','pgg45')]

predictors_test=prostate_data_test[ ,c('lcavol','lweight','age','lbph','svi','lcp','gleason','pgg45')]
#Scale function to standardize mean and sd

predictors_scaled=as.data.frame(scale(predictors_train))

prostate_data_train=data.frame(prostate_data_train$lpsa,predictors_scaled)

names(prostate_data_train)= c('lpsa', 'lcavol', 'lweight',    'age',   'lbph'  , 'svi' ,   'lcp', 'gleason',  'pgg45')
predictors_scaled_test=as.data.frame(scale(predictors_test))

prostate_data_test=data.frame(prostate_data_test$lpsa,predictors_scaled_test)

names(prostate_data_test)= c('lpsa', 'lcavol', 'lweight',    'age',   'lbph'  , 'svi' ,   'lcp', 'gleason',  'pgg45')
  1. Use the best-subset selection method to select a subset of the variables.
library(caret)
library(leaps)
best_subsets <- regsubsets(lpsa~., data = prostate_data_train, nvmax = 8,nbest=1)
summary(best_subsets)
Subset selection object
Call: regsubsets.formula(lpsa ~ ., data = prostate_data_train, nvmax = 8, 
    nbest = 1)
8 Variables  (and intercept)
        Forced in Forced out
lcavol      FALSE      FALSE
lweight     FALSE      FALSE
age         FALSE      FALSE
lbph        FALSE      FALSE
svi         FALSE      FALSE
lcp         FALSE      FALSE
gleason     FALSE      FALSE
pgg45       FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
         lcavol lweight age lbph svi lcp gleason pgg45
1  ( 1 ) "*"    " "     " " " "  " " " " " "     " "  
2  ( 1 ) "*"    "*"     " " " "  " " " " " "     " "  
3  ( 1 ) "*"    "*"     " " " "  "*" " " " "     " "  
4  ( 1 ) "*"    "*"     " " "*"  "*" " " " "     " "  
5  ( 1 ) "*"    "*"     " " "*"  "*" " " " "     "*"  
6  ( 1 ) "*"    "*"     " " "*"  "*" "*" " "     "*"  
7  ( 1 ) "*"    "*"     "*" "*"  "*" "*" " "     "*"  
8  ( 1 ) "*"    "*"     "*" "*"  "*" "*" "*"     "*"  

The best possible subset containing two predictors contains lcavol and lweight. This is also known as the “one standard error” rule, whereby the model that is simplest but also within one standard deviation of the minimum sum of squares model.

Note the stars indicate which variable we would include in the best subet. The argument nbests returns the best subset for each size up to nvmax.

linear_best_subset=lm(lpsa~lcavol+lweight,data=prostate_data_train)
summary(linear_best_subset)

Call:
lm(formula = lpsa ~ lcavol + lweight, data = prostate_data_train)

Residuals:
   Min     1Q Median     3Q    Max 
-1.589 -0.442  0.013  0.526  1.931 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.4523     0.0930   26.37  < 2e-16 ***
lcavol        0.7799     0.0982    7.94  4.1e-11 ***
lweight       0.3519     0.0982    3.58  0.00066 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.761 on 64 degrees of freedom
Multiple R-squared:  0.615, Adjusted R-squared:  0.603 
F-statistic: 51.1 on 2 and 64 DF,  p-value: 5.54e-14
  1. Using the training set, construct a model with only an intercept. Then calculate the residual sum of squares (RSS).
#We need to calculate RSS for the intercept only model first. RSS_int

prostate_intercept=lm(lpsa~1,data=prostate_data_train)

intercept_pred=predict(prostate_intercept,new.data=prostate_data_train)
RSS_intercept=sum((intercept_pred-prostate_data_train$lpsa)^2)
RSS_intercept
[1] 96.3
  1. Plot the RSS against the subset size \(k=\{0,1,...,8\}\)
plot(x=0:8,c(RSS_intercept,summary(best_subsets)$rss),xlim=c(0,8),ylim=c(0,100),xlab="Subset Size k",ylab="Residual Sum-of-Squares",col='red')
lines(x=0:8,c(RSS_intercept,summary(best_subsets)$rss),col='red')

9. Re-run regsubsets with nbest = 1, and then construct plots of adjusted \(R^2\), \(C_p\), and BIC against number of variables. You will need to extract these criteria from the summary object.

# Modify some graphical parameters
par(mfrow = c(1, 3))
par(cex.axis = 1.5)
par(cex.lab = 1.5)

AllSubsets <- regsubsets(lpsa~., data = prostate_data_train, nvmax = 8,nbest=1)
AllSubsets.summary <- summary(AllSubsets)
plot(1:8, AllSubsets.summary$adjr2, xlab = "subset size", ylab = "adjusted R-squared", type = "b")
plot(1:8, AllSubsets.summary$cp, xlab = "subset size", ylab = "Mallows' Cp", type = "b")
abline(0,1,col=2)
plot(1:8, AllSubsets.summary$bic, xlab = "subset size", ylab = "BIC", type = "b")

This now gives the “best” choice of explanatory variables for models of size 1,2,…8. So out of all models with only a single covariate, we would choose (“lcavol”) as the best predictor.

\(R^2\) recommends p=7; BIC recommends 2, 3 or 4.

For the \(C_p\) Mallows, you can add a reference line to help you to locate the subset size that satisfies \(C_p \approx p.\) From the above plot, it is around 7.

Question 4 PRESS for Bodyfat data (Question 2)

As we saw in Lecture 10, one internal measure of predictive ability of a model is the PRESS statistic. The idea is to set aside a single observation \(y_i\) from the dataset, and then, using the model constructed from the remaining data, predict the value of the observation left out (\(\hat{y}_{-i}\)), and calculate the residual \(y_i - \hat{y}_{-i}\). Do this until all observations have been left out, and then calculate the sum of the squares of these residuals, i.e., \[ \mbox{PRESS} = \sum_{i=1}^n (y_i - \hat{y}_{-i})^2 \] As we saw in Lecture 10, for a linear regression, calculating \(\mbox{PRESS}\) doesn’t actually require this iterative procedure. Instead, \(\mbox{PRESS}\) can be calculated more easily using values computed from fitting the model to all the data, i.e., \[ \mbox{PRESS} = \sum_{i=1}^n \left(\frac{\hat{e}_i}{1 - h_{ii}}\right)^2 \] In the expression above, \(\hat{e}_i\) is the \(i\)th residual, and \(h_{ii}\) is the \(i\)th diagonal element of the ‘hat’ matrix \(H = X(X'X)^{-1}X'\).

The library DAAG (Maindonald, J.H. and Braun, W.J. (2010) Data Analysis and Graphics Using R, 3rd ed. Cambridge: Cambridge University Press) contains the function press that you can use instead of coding the explicit expressions above.

If you do not have DAAG installed, you have to install it first, and then load it into your R session using the command library(DAAG) before you can invoke press.

Compare the value of the \(\mbox{PRESS}\) statistic for the forward and all parameters models you constructed above in Question 2. The models are lmfwd and lmall

# Install DAAG first, using Tools -> Install Packages ...
# Then load it here:
# library(DAAG)
# Calculate PRESS for forward selection
press(lmfwd)
[1] 4269
# Calculate PRESS for ALL  
press(lmall)
[1] 4403

Question 5 Sheather (2009): Modelling defective rates

We revisit this dataset to include quadratic terms in the model.

The data frame Defects provides data on the average number of defects per 1000 parts (Defective) produced in an industrial process along with the values of other variables (Temperature, Density, and Rate). The production engineer wishes to construct a linear model relating Defective to the potential predictors.

  1. Use the pairs function to produce a scatterplot matrix of all the variables in Defects. Are the scatterplots of Defective against the other variables linear?
defects=read.table("defects.txt",header=T)
defects=defects[,-1]
View(defects)
# Scatterplot
pairs(defects)

library(PerformanceAnalytics)
chart.Correlation(defects)

From the scatterplots, we can see that Defective has:

Clearly this is a strong positive relationship between the number of defects produced and each of the other variables, although there is a visible curvature to the relationships suggesting that these relationships are not linear (although the relationships between each of temperature, density and rate are so).

  1. Develop a regression model that directly predicts the number of defects using a subset or all of the 4 potential predictor variables listed above.

Explain the fitted model, by interpreting the R output. We have done this in Comp Lab Week 10.

def.lm=lm(Defective~Temperature+Density+Rate, data=defects)
summary(def.lm)

Call:
lm(formula = Defective ~ Temperature + Density + Rate, data = defects)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.737  -4.112  -0.575   2.762  16.328 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   10.324     65.926    0.16    0.877  
Temperature   16.078      8.294    1.94    0.063 .
Density       -1.827      1.497   -1.22    0.233  
Rate           0.117      0.131    0.89    0.380  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.11 on 26 degrees of freedom
Multiple R-squared:  0.88,  Adjusted R-squared:  0.866 
F-statistic: 63.4 on 3 and 26 DF,  p-value: 4.37e-12
  1. Multicollinearity can be detected using the cut-off of VIF (variance inflation factor). Correlation amongst the predictors increases the variance of the estimated regression coefficients

\[var(\hat{\beta_j})=\frac{1}{1-r_{12}^2} \frac{\sigma^2}{S_{x_j}^2}\] where \(r_{12}\) denote the correlation between \(x_1\) and \(x_2\) and \(S_{x_j}\) denote the standard deviation \(x_j.\) The term \(\frac{1}{1-r_{12}^2}\) is called a variance inflation factor (VIF).

There are high correlations among explanatory variables as shown in (a), which may be related to multicollinearity.

The variance inflation factor VIF can be calculated using the function ‘vif’ in the ‘car’ R package.

library(car)
vif(def.lm)
Temperature     Density        Rate 
      13.43       14.51        6.64 

A number of these variance inflation factors exceed 5, the cut-off often used, and so the associated regression coefficients are poorly estimated due to multicollinearity.